Okapi BM25 について

Words near each other

・ OKB Fakel
・ OKB Gidropress
・ OKB-1 140
・ Okanogan River
・ Okanogan Steamboat Company
・ Okanogan, Washington
・ Okanoue Kikue
・ Okanoyama Yoshiro
・ Okaoni
・ Okapa
・ Okapa District
・ Okapi
・ Okapi (disambiguation)
・ Okapi (knife)
・ Okapi Aalstar
・ Okapi BM25
・ Okapi Conservation Project
・ Okapi Forum
・ Okapi Framework
・ Okapi MPV
・ Okapi Wildlife Reserve
・ Okapilco Creek
・ Okara
・ Okara (food)
・ Okara Cantonment
・ Okara Cantonment railway station
・ Okara District
・ Okara Park
・ Okara railway station
・ Okara Tehsil

Dictionary Lists

mini英和辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Okapi BM25 ：ウィキペディア英語版

Okapi BM25
In information retrieval, Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.
The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function.
BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.
== The ranking function ==

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.
Given a query

Q

, containing keywords

q_1, ..., q_n

, the BM25 score of a document

D

is:
:

\text(D,Q) = \sum_^ \text(q_i) \cdot \frac})},

where

f(q_i, D)

q_i

's term frequency in the document

D

|D|

is the length of the document

D

in words, and

avgdl

is the average document length in the text collection from which documents are drawn.

k_1

and

b

are free parameters, usually chosen, in absence of an advanced optimization, as

k_1 \in ()

and

b = 0.75

.〔Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. ''An Introduction to Information Retrieval'', Cambridge University Press, 2009, p. 233.〕

\text(q_i)

is the IDF (inverse document frequency) weight of the query term

q_i

. It is usually computed as:
:

\text(q_i) = \log \frac,

where

N

is the total number of documents in the collection, and

n(q_i)

is the number of documents containing

q_i

.
There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.
Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score.
This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:
* Each summand can be given a floor of 0, to trim out common terms;
* The IDF function can be given a floor of a constant

\epsilon

, to avoid common terms being ignored at all;
* The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Okapi BM25」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース